† Corresponding author. E-mail:
Project supported by the National Key R&D Program of China (Grant No. 2016YFB0700503), the National High Technology Research and Development Program of China (Grant No. 2015AA03420), Beijing Municipal Science and Technology Project, China (Grant No. D161100002416001), the National Natural Science Foundation of China (Grant No. 51172018), and Kennametal Inc.
Since its launch in 2011, the Materials Genome Initiative (MGI) has drawn the attention of researchers from academia, government, and industry worldwide. As one of the three tools of the MGI, the use of materials data, for the first time, has emerged as an extremely significant approach in materials discovery. Data science has been applied in different disciplines as an interdisciplinary field to extract knowledge from data. The concept of materials data science has been utilized to demonstrate its application in materials science. To explore its potential as an active research branch in the big data era, a three-tier system has been put forward to define the infrastructure for the classification, curation and knowledge extraction of materials data.
As the practice of obtaining information and insight from data, data science has become a very familiar term to researchers from various disciplines.[1] This concept was first introduced in the 1960s and lasted for a few decades. In 1996, statistician CF Jeff Wu used the term again, describing the discipline as an extension of statistics. Nowadays, huge amounts of scientific data are produced by simulations, high-throughput scientific instruments, satellites, telescopes, and so on. The availability of big data is revolutionizing how research is conducted and has led to the emergence of a new paradigm in science based on data-intensive computing and analytics. Data science is defined as the fourth paradigm of data-intensive scientific discovery, alongside experimentation, theory and calculation.[2] Furthermore, the release of the Big Data R&D Initiative in 2012 has accelerated the development of data science.
Data science has been applied in diverse disciplines in recent years. An integrated data science pipeline is used to identify latent signals for QT-DDIs by using electrocardiogram data in electronic health records.[3] A data-driven approach was taken to simulate human mobility, provide spatial models, and predict the nationwide consequences of mass switching to electric vehicles.[4] To address the challenges of the prevailing development of big data, as well as the Data Science Journal, new journals including the Journal of Data Science and Analytics (JDSA)[5] and Data Science and Engineering (DSE)[6] were launched to stimulate scientific innovation and practice in data management and data-intensive applications.
In this study, we will introduce a data ecosystem for materials data, and aim to organize it into a coherent portrait of the scientific study of materials data, which are related to each other and to the materials science and engineering disciplines. The materials data ecosystem is comprised of data sources and data science. There are diverse kinds of materials data sources: publications, records from facilities and computation tools, third-party data, and so on, which are not covered in this text in detail. Materials data science, inherently cross-functional and at the very highest level of data study, is investigated here.
In 1999, John R. Rodgers introduced the new concept of materials informatics and defined it as an effective data management tool for new materials discoveries. Somewhat later, Integrated Computational Materials Engineering (ICME, 2008) and the Materials Genome Initiative (MGI, 2011) attracted more attention worldwide on integrating computational capabilities, data management, and experimental techniques.[7] Although materials databases were built in many countries with universal access to abundant scientific data—and some have become fundamental to materials computation—materials data and materials informatics[8] received their first recognition when compared with computation and experimentation in materials innovation. The concept of materials data infrastructure was put forward based on the integration of ICME and materials informatics[9], however, the diversity of materials science has yet to be exhibited. Therefore, a system which enables one to virtually express real-world materials details as well as data mining needs to be built.
Materials data science is data science applied in materials science and engineering, aiming to take advantage of data to discover the science beneath the observed phenomena and production. It is a new, inherently cross-disciplinary approach. Currently, the marketʼs ever-demanding requirements push people to need to understand the whole production chain of materials. Therefore, materials researchers strive to build a link between the five core points: chemical composition, microstructure, manufacturing, properties, and performance in service.
Knowledge engineering connects data with information, knowledge, and intelligence from a conceptual perspective. When combined with scientific disciplines, data are endowed with relevant meaning and the complicated correlations among data can be explained. To promote data science into a sub-field in materials science and engineering, a three-tier infrastructure of materials data science was introduced, as shown in Fig.
The generation of a single material datum represents a kind of attribute with the characteristics of a specific material, and may have the potential to be applied with limited scope according to the materialʼs nature and its inherent correlation. As there are several methods of classification for materials, each of which has pros and cons, the classification of materials data is certainly the same. So, borrowed from the materials scientific data sharing network,[10] the materials data system consists entirely of 11 categories on the first level shown in Fig.
The categories can be further classified into more sub-layers according to the classification system of each individual material. Furthermore, the system is also feasible as a basis for knowledge to construct a metadata system for the materials.
Data, which are the product in the data era, exhibit common characteristics of real commodities and experience a similar development process. The life cycle of data involves production, storage, updates, management, publication, application, and finally deletion or long-term storage for re-use. The comparison of the life cycle process, shown in Fig.
The description of scientific data is associated with storage and presentation activities. Based on the materials data system, as well as data sources from computation, experimentation, characterization and industry, we are drafting a standard for materials scientific data description, whereby the materials data are divided into three groups, that is, the experimental data, the computational data, and the production data. The experimental data are further divided into two sub-groups, one is the experimental data of bulk materials and the other is the data on specific subjects such as coatings and corrosion.
For different groups and sub-groups, the attributes being collected in the databases are different due to their inherent generation and usage, which requires comprehensive information, covering everything from the quantum to the macro-scale in academic activities, as described in ICME. As the MGI describes, the goals of time-halving and cost-halving will be fulfilled when endeavors focus on an innovative re-use of the data. To meet the requirement for innovative material discovery, it is mandatory to add a detailed description of the material production process for experiments, and prerequisites for computational data, which were occasionally omitted in the past. In the past, due to the limited approaches in materials research, most of the information about data generation processes was omitted, and the re-use of data was restricted solely to questions of materials’ performance.
To emphasize the whole chain of materials production and optimization, the integrity of each item of data is especially significant. The key attributes for the three groups of materials data are listed in Table
The database, an organized collection of data, has been the typical way to define, store, update, and administer data whereby the data are accessible to query and retrieve. The inorganic crystal structure database (ICSD), the Pauling file, databases for thermodynamic computation, and so on, are those that are specifically used and associated with the software of first-principles calculation, thermodynamics, and properties simulations. Others are mostly about the properties obtained from past research and industrial activities.
The database is well-developed for the curation of raw materials data, while data warehouses have appeared in recent years for data mining and to store specific-topic and integrated data from one or more disparate sources. Besides databases and data warehouses, cloud storage provides a brand new choice. Currently, cloud computing has been applied to provide Paas and Iaas services equipped with both hardware and software in some supercomputing centers and companies.
The cloud computing platform will definitely be utilized by more materials data researchers once the privacy and intellectual property issues in materials communities have been settled.
Therefore, databases, data warehouses, and cloud storage are three alternative candidates for materials researchers to optimize data storage of their own data resources.
In recent years, the contradiction between data sharing and proprietary interest protection has become more obvious and is turning into a global problem to tackle. In academia, which is one of the most significant sources of scientific data, the owners of data are reluctant to share, or may even reject sharing, any supplemental information, as well as the final outcomes, which are essential to understanding or reproducing the data, due to the protection of intellectual property and potential commercial value, although government funding agencies and scientific journals require one to do so. In materials science and the like, there are some serious issues related to national security on some specific topics and it is prohibited to share data. Therefore, difficulty in discriminating where to draw boundaries in this context impedes communication on data. Tim Austin[12] considered data citation as a possible solution.
The combination of two widely-available methods relating to papers, which are identification and citation, may be one feasible solution. The use of digital object identification (DOI) has been a common method worldwide for protection of the intellectual property of papers in publication in recent years,[13] by which the papers are uniquely labelled with a series of numbers and letters once the registration of DOI is done. Similarly, DOI for scientific data began for geographic data in China a few years ago[14] and now a formal and detailed format for registration and citation has been established. Later, a DOI system for materials data was founded in China, based on work on the National Materials Scientific Data Sharing Network, shown in Fig.
Publication of scientific data is regarded as one of the means of data sharing, as well as of evaluating the contributions of data collectors. The DOI/CSTR provide the information of the data as the metadata for database management and information querying and retrieval.[15] Sungbum Park et al. produced an IS success model for evaluating the application of the DOI system, and indicated that both data content including the features and information quality are significant factors to influence organizational benefits by means of perceived usefulness and user satisfaction.[16] Accordingly, the DOI of materials data should be implemented at the point when the data are collected and integrated into the databases due to its high correlation to the application of databases. Unlike in the field of human health data which is creating a global coalition of data resources,[17] currently, an internationally accessible materials data infrastructure hardly appears, however, DOI is paving the way towards this goal.
Cross-scale modeling is an interesting topic following the exploitation of multiscale modeling.[18] Smart manufacturing of materials requires cross-scale modeling, simulation and control, by taking advantage of a combination of information and materials knowledge, where data are the fundamental elements and data transfer across scales is crucial. So the characteristics of big data in the smart manufacturing of materials are high dimension and complicated correlation rather than high volume.
Krishna Rajan pointed out that there currently lacks a unified way to explore patterns of behavior across correlative databases.[19] To bridge the gaps between the databases and cross-scale research activities, it is essential to understand the input and output for each scale, that is, the relevant attributes as prerequisites and boundary conditions for computation/experiment, and the results in a data format. A knowledge-based understanding of the exploitation process of powder metallurgy materials is shown in Fig.
Data science is regarded as the fourth paradigm of data-intensive scientific discovery, alongside experimentation, theory, and calculation.[2] In materials science and engineering, with the emergence of materials informatics, integrated computational materials engineering and the materials genome, materials data are going beyond collection and integration and entering a new stage of application for the exploration and discovery of new or alternative materials, which is the core of data-driven materials research and a further step ahead in materials design. Materials informatics is becoming a methodology for data mining and machine learning in materials science.[20]
According to its functionality, the research of data mining in materials science is divided into two categories, one for the creation of new materials based mainly on first-principles calculations, and the other for the improvement of properties by optimizing composition and processing.
In the past few years, the breakthrough of combining the MGI and data mining has emerged swiftly. The discovery of brand-new functional materials candidates, especially for clean energy storage, has been frequently reported in the journals Science and Nature and the term materials code appeared in a cover article of the journal Nature.[21] High throughput first-principles calculations (HTCs) make it possible to obtain massive volumes of data, leading to the most abundant data resource for data mining and machine learning.[22] The Materials Project,[23] Automatic Flow for Materials Discovery (AFLOWLib), Open Quantum Materials Database (OQMD), Novel Materials Discovery (NoMaD) repository, CatApp Database, and Computational Materials Repository (CMR)[24] are the newly-established ab initio databases, where millions of data are integrated. By using the methods of principal component analysis, regression, neural networks, and Bayesian algorithms, materials with tailored properties have been discovered, such as Ti50.0Ni46.7Cu0.8Fe2.3Pd0.2.[25,26]
The integration of materials design and processing optimization[27] boosts research into solving the problems for the full work flow.[28,29] In this case, data range from the calculated elements to the detailed parameters in fabrication, and data mining extends the ideas of ICME to extract semantic connections, which are central to solving tough problems of integration, cleaning, and analysis among the attributes in experimentation and large-scale production.[10,30]
Materials data play a vital role in materials research. Industrial applications of materials data will be a positive stimulus for the systematic establishment and implementation of materials data science on research as well as education. Smart manufacturing aims to take advantage of advanced information and manufacturing technologies to enable flexibility in physical processes, therefore industria 4.0 enables one to apply the data to the whole work flow and the opportunity to push materials data science forward into a knowledge engineering system to realize artificial intelligence (AI) in materials innovation and production.
Materials data science, as a form of data science, is an interdisciplinary field which combines materials science with computer science and math, as well as physics and chemistry. Collaboration is urgently needed to move towards meeting core requirements and goals, one of which is to achieve the integration of materials theory and knowledge with the algorithms and methods of data mining and machine learning.
[1] | |
[2] | |
[3] | |
[4] | |
[5] | |
[6] | |
[7] | |
[8] | |
[9] | |
[10] | |
[11] | |
[12] | |
[13] | |
[14] | |
[15] | |
[16] | |
[17] | |
[18] | |
[19] | |
[20] | |
[21] | |
[22] | |
[23] | |
[24] | |
[25] | |
[26] | |
[27] | |
[28] | |
[29] | |
[30] |